Handling Missing Values when Applying Classication Models
نویسنده
چکیده
Much work has studied the e¤ect of di¤erent treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper rst compares several di¤erent methods predictive value imputation, the distribution-based imputation used by C4.5, and using reduced models for applying classi cation trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under di¤erent conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: Missing Data, Classi cation, Classi cation Trees, Decision Trees, Imputation
منابع مشابه
Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank
Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...
متن کاملDEA with Missing Data: An Interval Data Assignment Approach
In the classical data envelopment analysis (DEA) models, inputs and outputs are assumed as known variables, and these models cannot deal with unknown amounts of variables directly. In recent years, there are few researches on handling missing data. This paper suggests a new interval based approach to apply missing data, which is the modified version of Kousmanen (2009) approach. First, the prop...
متن کاملDecision-Rule Solutions for Data Mining with Missing Values
A method is presented to induce decision rules from data with missing values where (a) the format of the rules is no di erent than rules for data without missing values and (b) no special features are speci ed to prepare the the original data or to apply the induced rules. This method generates compact Disjunctive Normal Form (DNF) rules. Each class has an equal number of unweighted rules. A ne...
متن کاملHandling Missing Values when Applying Classification Models
Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with...
متن کاملEvaluating Trauma Patients: Addressing Missing Covariates with Joint Optimization
Missing values are a common problem when applying classification algorithms to real-world medical data. This is especially true for trauma patients, where the emergent nature of the cases makes it difficult to collect all of the relevant data for each patient. Standard methods for handling missingness first learn a model to estimate missing data values, and subsequently train and evaluate a cla...
متن کامل